Intro to the Socrata API with the NYC Dog Licensing Dataset & Python
Contents
Intro to the Socrata API with the NYC Dog Licensing Dataset & Python¶
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sodapy import Socrata
import plotly.express as px
from urllib.request import urlopen
import json
# from IPython.core.display import display, HTML
from IPython.display import IFrame
pd.options.display.max_rows = 100
What is an API?¶
The Socrata API follows the REST (REpresentational State Transfer) design pattern.
REST stands for REpresentational State Transfer. It originally had a more abstract meaning, but has come to be a shorthand name for web sites that act a bit like python functions, taking as inputs values for certain parameters and producing outputs in the form of a long text string.
API stands for Application Programming Interface. An API specifies how an external program (an application program) can request that a program perform certain computations.
Putting the two together, a REST API specifies how external programs can make HTTP requests to a web site in order to request that some computation be carried out and data returned as output. When a website is designed to accept requests generated by other computer programs, and produce outputs to be consumed by other programs, it is sometimes called a web service, as opposed to a web site which produces output meant for humans to consume in a web browser.
Anatomy of a URL with the Socrata API¶
For this demonstration we will be making requests to the Socrata API with the The NYC Dog License Dataset.
Components of the url:
headers
endpoints
parameters
API key
In a REST API, the client or application program makes an HTTP request that includes information about what kind of request it is making. Web sites are free to define whatever format they want for how the request should be formatted.
In this format, the URL has a standard structure:
the base URL: https://data.cityofnewyork.us/resource/nu7n-tubp.json
a ? character
or more key-value pairs (parameters), formatted as key=value pairs and separated by the & character
For example, consider the following url requests to the NYC Dog License Dataset with the Socrata API:
https://data.cityofnewyork.us/resource/nu7n-tubp.json?rownumber=40895
https://data.cityofnewyork.us/resource/nu7n-tubp.json?licenseissueddate=2014-09-12T00:00:00.000
https://data.cityofnewyork.us/resource/nu7n-tubp.json?animalname=PAIGE&zipcode=10035
Try copying that URL into a browser, or just clicking on it. Depending on your browser, it may put the contents into a file attachment that you have to open up to see the contents, or it may just show the contents in a browser window.
API Documentation¶
The API documentation for the NYC Dog License Dataset contains all of the parts of the url that we need. The fields in the documentation describe the parameters we can use to filter the data in the url request. It is important to read the API documentation for every dataset you use in NYC Open Data, as each dataset has unique features that need to be considered when making requests.
Application Tokens¶
Using the Socrata client to make requests¶
We are using the Python sodapy library to make our request. The client sends our request for the data, and the API sends the data to us in .json format. Then the pandas library is used to format the results into a dataframe. The Socrata().get() method is used with parameters to filter the data in our request. Check out the SoSQL examples in the sodapy github for more info. The filters use the SoSQL statements, which are based on SQL and have similar syntax.
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)
# Example authenticated client (needed for non-public datasets):
# client = Socrata('data.cityofnewyork.us',
# 'appTOKEN',
# username='username',
# password='password')
# Results returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("nu7n-tubp",
limit=5000,
where = "extract_year = '2017' AND breedname = 'Boxer'",
select = "animalname, breedname, zipcode, extract_year",
order = "zipcode"
)
# Convert to pandas DataFrame
doggy_data = pd.DataFrame.from_records(results)
doggy_data['count'] = 1
WARNING:root:Requests made without an app_token will be subject to strict throttling limits.
doggy_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 animalname 962 non-null object
1 breedname 962 non-null object
2 zipcode 962 non-null object
3 extract_year 962 non-null object
4 count 962 non-null int64
dtypes: int64(1), object(4)
memory usage: 37.7+ KB
doggy_data.head(20)
| animalname | breedname | zipcode | extract_year | count | |
|---|---|---|---|---|---|
| 0 | LUCKY | Boxer | 10001 | 2017 | 1 |
| 1 | CLAY | Boxer | 10001 | 2017 | 1 |
| 2 | BRISCOE | Boxer | 10001 | 2017 | 1 |
| 3 | ALEXANDER | Boxer | 10001 | 2017 | 1 |
| 4 | HAZEL | Boxer | 10001 | 2017 | 1 |
| 5 | BECKEM | Boxer | 10001 | 2017 | 1 |
| 6 | BESSIE | Boxer | 10002 | 2017 | 1 |
| 7 | TALLULAH | Boxer | 10002 | 2017 | 1 |
| 8 | POOKIE | Boxer | 10002 | 2017 | 1 |
| 9 | MUGSY | Boxer | 10002 | 2017 | 1 |
| 10 | BRUTUS | Boxer | 10002 | 2017 | 1 |
| 11 | STAR | Boxer | 10002 | 2017 | 1 |
| 12 | NALA | Boxer | 10002 | 2017 | 1 |
| 13 | DEXTER | Boxer | 10002 | 2017 | 1 |
| 14 | BILLIE | Boxer | 10002 | 2017 | 1 |
| 15 | GEMMA | Boxer | 10003 | 2017 | 1 |
| 16 | SOPHIE | Boxer | 10003 | 2017 | 1 |
| 17 | TALLULAH | Boxer | 10003 | 2017 | 1 |
| 18 | RILEY | Boxer | 10003 | 2017 | 1 |
| 19 | APOLLO | Boxer | 10003 | 2017 | 1 |
doggy_data['breedname'].unique()
array(['Boxer'], dtype=object)
doggy_data['extract_year'].unique()
array(['2017'], dtype=object)
Visualize the data with plotly express¶
# create grouped dataframe by count of dog registration by zipcode
doggy_zips_grouped = doggy_data[['zipcode', 'count']].groupby(by = 'zipcode').sum().reset_index()
# use GeoJSON file from NYC opendata which contains GIS data for zip code boundaries in NYC
# web page with info on data set url here:
# https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk
with urlopen('https://data.cityofnewyork.us/resource/pri4-ifjk.geojson') as response:
zip_codes = json.load(response)
fig = px.choropleth_mapbox(doggy_zips_grouped, geojson=zip_codes, locations='zipcode', color='count',
featureidkey='properties.modzcta',
color_continuous_scale="Viridis",
range_color=(0, max(doggy_zips_grouped['count'])),
mapbox_style="carto-positron",
zoom=9.25, center = {"lat": 40.743, "lon": -73.988},
opacity=0.5,
labels={},
title="Number of Dog License Registrations in NYC by Zipcode"
).update(layout=dict(title=dict(x=0.5)))
fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0})
fig.show()
# fig.write_html('export/doggy-zip-map.html')
# display(HTML('export/doggy-zip-map.html'))
# IFrame(src = 'export/doggy-zip-map.html', width=700, height=600)